Data

This is an attempt at using a non-linear measure of correlation to identify similar trend in a time series. The idea and code is heavily influenced by this article. Additional ideas (not yet implemented) are influenced by Rob Hyndman’s workshop at the RStudio::conf 2020.




Distance Correlation

Distance correlation is a type of correlation that can detect non-linear and non-monotonic correlations. I became aware of distance correlation from a comment to an article discussing problems with a another non-linear, non-monotonic correlation measure.




Example: Using Distance Correlation on a Stock Ticker

Let’s try this method on Apple stock indeces. First, we download Apple’s data from Yahoo Finance using the quantmod package:

## Load Suncor stock
cenovus <- getSymbols("CVE", src = 'yahoo', from = '2015-01-01', auto.assign = F) %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "date") %>% 
  mutate(date = ymd(date)) %>% 
  select(date, CVE.Open)

cenovus %>% 
  slice(1:10) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = TRUE, font_size = 12)
date CVE.Open
2015-01-02 20.420000
2015-01-05 20.280001
2015-01-06 19.420000
2015-01-07 19.799999
2015-01-08 19.639999
2015-01-09 19.910000
2015-01-12 19.690001
2015-01-13 19.080000
2015-01-14 18.690001
2015-01-15 19.930000



Now let’s try to identify a pattern of interest, for instance, the rapid downspike between 2015-08-11 and 2014-09-01:

## Identify pattern
pattern <- cenovus %>% 
  filter(between(date, as.Date("2019-04-04"), as.Date("2019-05-06")))


(cenovus %>% 
  ggplot(aes(x = date, y = CVE.Open)) +
  geom_line() +
  geom_line(data = pattern, color = "red") +
  labs(x = "") +
  scale_x_date(breaks = "6 months", date_labels = "%Y-%m" )) %>% 
  ggplotly()



Let’s calculate the distance correlation measure and plot the top 3:

## Calculate distance
n <- nrow(pattern)
df2 <- cenovus %>%
    anti_join(pattern, by = "date") %>% ##remove pattern
    group_by(grp = as.integer(gl(n(), n, n()))) %>% ##create grouping factors
    filter(n() == n) %>% ##removes windows of unequal number (usual last)
    mutate(cor1 = DCOR(scale(pattern$CVE.Open), scale(CVE.Open))$dCor) %>% 
    ungroup()


## Plot
df3 <- df2 %>%
  mutate(rank1 = dense_rank(desc(cor1)))


(ggplot() +
  ## Plot full series
  geom_line(data = cenovus, aes(x = date, y = CVE.Open), color = "grey") +
  ## Plot pattern
  geom_line(data = pattern, aes(x = date, y = CVE.Open), color = "red", size = 0.8) +
  ## Plot closest pattern by corr.x
  geom_line(data = filter(df3, rank1<=3), aes(x = date, y = CVE.Open, group = grp), color = "purple") +
  scale_x_date(breaks = "12 months") +
  labs(x = "",
       title = "Cenovus. Red: Pattern, Purple: top 3 matches")) %>% 
  ggplotly()


The top 3 correlations are pretty close in shape. Notice that this approach cannot distinguish symmetry which is quite important for many applications. See some of the untested improvements ideas below that may help address this.




Improvements:

Below are some improvements ideas based on creating several statistics and using principal components to reduce the dimensionality:

df2 <- df %>%
    mutate(cor1      = Hmisc::spearman2(scale(pattern$var), scale(var))[[1]]) %>% 
    mutate(cor4      = Hmisc::hoeffd(scale(pattern$var), scale(var))$P[[1,2]]) %>% 
    mutate(entropy   = entropy(var) - entropy(pattern$var)) %>% 
    mutate(flat      = longest_flat_spot(var) - longest_flat_spot(pattern$var)) %>% 
    mutate(cross     = n_crossing_points(var) - n_crossing_points(pattern$var)) %>% 
    mutate(lumpiness = lumpiness(var) - lumpiness(pattern$var)) %>% 
    mutate(hurst     = hurst(var) - hurst(pattern$var)) %>% 
    mutate(nonlin    = nonlinearity(var) - nonlinearity(pattern$var)) %>% 
  
  
## Principal component or similar
df.pc <- df2 %>% 
  select(-date, -var, -grp, -lumpiness)

prc <- prcomp(df.pc, scale = TRUE)

df2 <- df2 %>% 
  bind_cols(as.data.frame(prc$x))




Last updated: January 29, 2021